Görtler, J., Kehlbeck, R., & Deussen, O. (2019). A visual exploration of Gaussian processes. Distill, 4(4). doi:10.23915/distill.00017
Deisenroth, M., Luo, Y., & van der Wilk, M. (2020). A practical guide to Gaussian processes. https://infallible-thompson-49de36.netlify.app/.
Q: How do we choose hyperparameters \(\theta = \left(\ell, \sigma_f, \dots, \right)\)
A: Find \(\theta\) that makes the observed data \(\mathbf{y}\) most probably:
Marginal likelihood
\[p(\mathbf{y} | \mathbf{X}, \theta) = \int p(\mathbf{y} | \mathbf{f}, \mathbf{X}) p(\mathbf{f} | \mathbf{X}, \theta) d\mathbf{f}\]
This “marginalizes” over all possible functions \(\mathbf{f}\).
For the GP model, this has a closed form:
\[\log p(\mathbf{y} | \mathbf{X}, \theta) = -\frac{1}{2}\mathbf{y}^\top(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K} + \sigma_n^2\mathbf{I}| + \text{const}\]
This has two competing terms – data fit vs. model complexity.
Consider fitting a sine wave with noise. How would you choose the RBF \(\ell\)?
Very small \(\ell\) (underfitting)
Very large \(\ell\) (overfitting)
The right choice of \(\ell\) will be flexible enough to fit the sine curve but not the noise. The marginal likelihood finds this automatically without any cross-validation.
Observation = Signal + Noise
\[\begin{align*} y = f(x) + \epsilon \end{align*}\]
Variance Decomposition
\[\begin{align*} \underbrace{\operatorname{Var}[y]}_{\text {what we see }}=\underbrace{\sigma_f^2}_{\text {signal }}+\underbrace{\sigma_n^2}_{\text {noise }} \end{align*}\]
Heuristic 1: If the instrument’s precision is known (e.g., Kepler telescope noise \(\sigma_n \approx 10^-4\)), then set \[\begin{align*} \sigma_f^2=\operatorname{Var}[y]-\sigma_n^2 \end{align*}\]
Heuristic 2: Suppose the signal dominates by some assumed SNR factor \(\kappa\) (e.g., 2 - 100). Then set \[\begin{align*} \sigma_{f}^{2} \approx \operatorname{Var}[y]\\ \sigma_{n}^{2} \approx \frac{\sigma_f^2}{\kappa^2} \end{align*}\]
Lengthscale controls how far we travel before \(f(x)\) and \(f(x')\) decorrelate
Heuristic: Set lengthscale relative to spread of input data \[\ell \approx \lambda \cdot \text{SD}[\mathbf{X}], \quad \lambda \in [0.2, 10]\]
Alternative: Use median distance from mean \[\ell \approx \text{median}\{|x_i - \bar{x}|\}\]
There seem to be two periodicities that dominate.
\[k_{\text{rot}} = k_{\text{periodic}} \times k_{\text{decay}}\]
Physical interpretation:
The original source (, ) used a slightly more involved periodoc kernel. It used a similar domain-informed kernel construction though.
This is how we can implement the kernel in R.
k_periodic <- periodic(period = exp(logperiod),
lengthscale = 1,
variance = exp(logamp))
k_decay <- mat52(lengthscale = exp(logdecay),
variance = 1)
full_kernel <- k_periodic * k_decay + white(exp(logs2))Hyperparameters
logperiod: \(\log p\) (rotation period)logdecay: \(\log \ell_{\text{decay}}\) (spot lifetime)logamp: \(\log \sigma_f^2\) (signal variance)logs2: \(\log \sigma_n^2\) (noise variance)Rationale
logperiod ~ N(log(25), 2): Centered at periodogram guess, but widelogdecay ~ N(log(75), 2): Lasts ~3 rotation periodslogamp ~ N(log(Var[y]), 2): Signal variance ~ data variancelogs2 ~ N(-1, 2): Relatively small noiseThis helps us find a mode of the posterior as starting point for MCMC.
Each draw here is a plausible function \(f(t)\) consistent with data and kernel
project evaluates the GP posterior distribution at the specified time_grid.
Observations
The kernel is a hypothesis about the data generating process.
What we learned
The GP is not just fitting curves, it’s extracting astrophysics.
Problem: Computing \(\left(\mathbf{K} + \sigma_n^2 I\right)^{-1}\) on a computer.
The condition number measures is the ratio of the largest and smallest eigenvalues of this matrix:
\[\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}\]
Problem: Large \(\kappa\) → numerical instability in matrix inversion
For GPs, we can bound \(\kappa \leq N\alpha^2 + 1\) where \(\alpha = \frac{\sigma_f}{\sigma_n}\).
Paradoxically, high-quality data and larger sample sizes create numerical problems!
When does \(\left(K + \sigma_n^2\right)\) become nearly singular?
Scenario 1: Nearly identical rows
Scenarios 2: Noise-free limit
Heuristic. Add small “jitter” to the diagonal
\[\begin{align*} \mathbf{K} + \sigma_n^2\mathbf{I} \to \mathbf{K} + (\sigma_n^2 + \epsilon^2)\mathbf{I} \end{align*}\]
Typically \(\epsilon ~ 10^{-6} - 10^{-4}\).
Input standardization \[\tilde{x} = \frac{x - \bar{x}}{\text{SD}[x]}\]
Also have to rescale \(\ell \to \ell / \text{SD}[x]\)
Output centering \[\tilde{y} = y - \bar{y}\]
We used this in the stellar variability study.
See 03-prep.ipynb.
GPs do well on all five… deep learning fails 2 - 5.
Linear regression
Gaussian processes
Both GPs and linear regression are interpertable, but in different senses.
Limitations:
Extensions:
[GP Missing Data] Rerun the case study notebook with a window of observations missing. How does it affect the posterior on the period? What do the posterior predictive samples look like?